Part II -A presentation research on Loan data records from prosper loan in the United States.¶

by Ayotunde Jeffers Doherty¶

Dataset and Investigation Overview¶

This data set is a loan data records from prosper Loan in the United States, it contains 113,937 loans with 81 variables on each loan, including loan amount, interest rate, current loan status, borrower income, and many others. The report in this part would be structured to provide summary of simple univariate relationships to multivariate relationships, this research provides answers to various questions like whether the monthly loan payment has a correlation or any relationship between loan original,amount, what is the spread of lterm of loan in loan status, identifying the frequency of the categorical variables; Term of loan, borrower's employment status, year of loan, and loan status, are there differences between loans depending on how the loan term large the original loan amount was. Key insights would be generated from this to be able to make a presentation with it. In spite of the fact that the dataframe has 81 features, this study is only interested in few of the features, it would be appropriate to shrink the dataframe to the useful columns for the purpose of this study. The data set consist of 113,937 rows and 81 columns, implying 113,937 recorded observations with 81 features. The main features of interest to this study include but not limited to the following; loan status, loan term, employment Status, is borrower a homeowner or not, borrower state, income verifiable or not and occupation. To get a better understanding of how this features of interest would be investigated a number of features would support this study which include the following features original loan amount, loan origination date, monthly loan payment, loan current days of delinquency, stated monthly income, investors and recommendations. In total 11 features were pulled together and form into a new dataframe to be reference for exploration and analysis. To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively to make for a better behavior as a categorical variable. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from state abbrevation to full text without leaving out the stated monthy income and monthy loan payment variable out from transformation, these variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type

Histogram Distibution of Loan Original Amount¶

The distribution of loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there exist outliers in the ranges between 32000 dollars and 35000 dollars, the data spread is from about 1000 dollars to 3500 dollars.

Histogram Distibution of Monthly Loan Payment¶

The monthly loan payment is also right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars, the data spread is from about zero dollars to 2251 dollars.

Kernel Density Estimate for Loan Original Amount¶

To identify a kernel density estimate data point of loan original amount. i.e the probabilty density function of the data points. Densities are useful because they can be used to calculate probabilities. From the visualization below the probability that a randomly chosen loan original amount will fall between 5000 dollars and 12000 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [5000, 12000].

Kernel Density Estimate for Monthly Loan Payment¶

To identify a kernel density estimate data point of loan original amount. i.e the probabilty density function of the data points. Densities are useful because they can be used to calculate probabilities. From the visualization below the probability that a randomly chosen monthly lona payment will fall between 300 dollars and 500 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [300, 500].

Term of Loan Distribution¶

To identify the frequency of the categorical variables term of loan; it was figured out from the visuals below that loans disbursed on the medium term in this case 36months has the highest occurence with a count of 87778 representing about 77 percent of loan term duration, leaving the other 23 percent distributed between the long term (60 months) and short term (12months)loan duration.

Borrower's Employment Status Distribution¶

To identify the frequency of the categorical variables borrower's employment status; it was discovered from the visuals below that those who are employed has the highest occurence in the employment status category with a count of 69557, those who are retired got the lowest occurence in the employment status category, it's more likely to disburse a laon to working class compare to a retired individual.

Loan Distribution by Year¶

To identify the frequency of the categorical variables year; it was discovered from the visuals below that the year 2013 had the highest number of loan disbursment with a occurence of 34345, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement.

Line Graph Depicting Relationship Between Monthly Loan Payment and Loan Original Amount¶

To establish any relationship or correlation between the continuous numerical variables; loan original amount, and monthly loan payment it was gathered from the visuals below that a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively.

Original Loan Amount Against Current Days of Delinquency Group by Loan Term¶

To showcase the relationship between three variables, two continous numerical variable(loan original amount and monthly loan payment) and a categorical variable(term) it was cemented from earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the scatterplot below categorized by term of loan.

Original Loan Amount Against Current Days of Delinquency Group by Loan Year¶

To showcase the relationship between three variables, two continous numerical variable(loan original amount and monthly loan payment) and a categorical variable(year) it was cemented from earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the scatterplot below categorized by the year of loan.

Original Loan Amount Against Current Days of Delinquency Group Employment Status¶

To showcase the relationship between three variables, two continous numerical variable(loan original amount and monthly loan payment) and a categorical variable(employment status) it was cemented from earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the scatterplot below categorized by employment status.

Correlation Matrix Depicting Relationship Between Variable with Heatmap¶

Depictig the correlation of numerical variables based on linear properties between variables by plotting heatmap of a correlation matrix the figure below indicates the correlation by the heatmap, we could vizualize we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them.

Text(0.5, 1.0, 'Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
[NbConvertApp] Converting notebook Part_II_slide_deck.ipynb to slides
[NbConvertApp] Writing 9428035 bytes to Part_II_slide_deck.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\Scripts\jupyter-nbconvert-script.py", line 10, in <module>
    sys.exit(main())
  File "C:\ProgramData\Anaconda3\lib\site-packages\jupyter_core\application.py", line 254, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\traitlets\config\application.py", line 845, in launch_instance
    app.start()
  File "C:\ProgramData\Anaconda3\lib\site-packages\nbconvert\nbconvertapp.py", line 350, in start
    self.convert_notebooks()
  File "C:\ProgramData\Anaconda3\lib\site-packages\nbconvert\nbconvertapp.py", line 524, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nbconvert\nbconvertapp.py", line 491, in convert_single_notebook
    self.postprocess_single_notebook(write_results)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nbconvert\nbconvertapp.py", line 463, in postprocess_single_notebook
    self.postprocessor(write_results)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nbconvert\postprocessors\base.py", line 28, in __call__
    self.postprocess(input)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nbconvert\postprocessors\serve.py", line 90, in postprocess
    http_server.listen(self.port, address=self.ip)
  File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\tcpserver.py", line 151, in listen
    sockets = bind_sockets(port, address=address)
  File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\netutil.py", line 161, in bind_sockets
    sock.bind(sockaddr)
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted